Code
uri <- paste0(github_ames, "AmesHousing.csv")
df = read.csv(uri) # data.frame
dt <- fread(uri) # data.tableIn the Ames Housing dataset, which is commonly used for predicting housing prices, there are several
features that can significantly influence the sales price of a house. The importance of these features
can vary depending on the specific dataset and the machine learning algorithm used for analysis.
However, based on general observations and common practices,
the following features are often considered as strong predictors of housing prices:
1. Overall Quality: The overall quality of a house, usually measured on a scale from 1 to 10, is a crucial factor affecting its sales price. Higher-quality homes tend to command higher prices.
2. Above Ground Living Area: The size of the above ground living area, typically measured in square feet, is a strong indicator of a house’s value. Larger houses generally have higher prices.
3. Number of Bedrooms: The number of bedrooms in a house is an important factor for many buyers. Houses with more bedrooms are typically priced higher.
4. Number of Bathrooms: Similarly, the number of bathrooms in a house plays a significant role in determining its value. More bathrooms often lead to higher prices.
5. Lot Size: The size of the lot on which a house is situated can influence its price. Larger lots are generally associated with higher prices, especially in desirable locations.
6. Neighborhood: The neighborhood in which a house is located can have a significant impact on its value.
uri <- paste0(github_ames, "AmesHousing.csv")
df = read.csv(uri) # data.frame
dt <- fread(uri) # data.tableThe Ames Housing dataset contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, Iowa [IA] from 2006 to 2010.
The dataset has 2,930 observations with 82 variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous). For a complete description of all included variables, please look at: https://rdrr.io/cran/AmesHousing/man/ames_raw.html.
Familiarize yourself with the data.
Provide a table with descriptive statistics for all included variables and check:
Classes of each of the variables (e.g. factors or continuous variables).
Descriptive/summary statistics for all continuous variables (e.g. mean, SD, range) and factor variables (e.g. frequencies).
Explore missing values: sapply(df, function(x) sum(is.na(x)))
dt %>%
setcolorder(c("Order", "SalePrice")) %>%
DT::datatable(
caption = "Table 1: Ames Housing dataset",
class = "compact stripe",
rownames = FALSE,
filter = 'top',
extensions = c('FixedColumns'),
options = list(
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
)
) %>%
formatCurrency("SalePrice", '\U0024', digits = 0) %>%
formatStyle(
'SalePrice',
color = "#003700",
fontWeight = "bold",
backgroundColor = '#FFFFF0',
backgroundSize = '100% 60%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center'
) %>%
formatStyle(
'Order',
color = '#C0C0C0',
backgroundColor = '#FFFFF0'
)str (no package needed)describe function (from the psych-package) for continuous variablestable function (base-R) for factor variables.# To check the structure of the data, you can use the "str"-command:
# str(dt)
# create a table with the type of the data
dt_str <-
dt[, lapply(.SD, typeof)] %>%
melt.data.table(
measure.vars = names(.),
variable.factor = FALSE) %>%
setorder(value, variable )
# display a summery per type
dt_str %>%
.[, .(count = .N), by = value] %>%
DT::datatable(
caption = "Table 2: Data structure summary",
class = "compact stripe",
rownames = FALSE,
options = list(
dom = "t"
)
) %>%
formatStyle(
"value",
color = "#370037",
backgroundColor = "#FFFFF0",
fontWeight = "bold"
)# display structure/type of the data
dt_str %>%
DT::datatable(
caption = "Table 3: Data structure and types",
class = "compact stripe",
rownames = FALSE,
filter = "top"
) %>%
formatStyle(
"variable",
color = "#370037",
backgroundColor = "#FFFFF0",
fontWeight = "bold"
)dt_chr <- dt_str[value == "character", variable]
dt_int <- dt_str[value == "integer", variable]All factor variables now have the ‘character’ class.
The following code helps to convert each character variable into a factor variable:
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
# str(df)
# convert character variables to factor variables
chr2fct <- function(x){
if(is.character(x))
as.factor(x)
else
x
}
# convert character variables to factor variables
# keep the integers
dt[, names(dt):= lapply(.SD, chr2fct)]
# display the factors and levels
str(dt[, ..dt_chr])Classes 'data.table' and 'data.frame': 2930 obs. of 43 variables:
$ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
$ Bldg Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
$ Bsmt Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 6 6 6 6 6 6 ...
$ Bsmt Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 3 5 5 5 5 5 4 5 5 5 ...
$ Bsmt Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 4 4 6 ...
$ BsmtFin Type 1: Factor w/ 7 levels "","ALQ","BLQ",..: 3 6 2 2 4 4 4 2 4 7 ...
$ BsmtFin Type 2: Factor w/ 7 levels "","ALQ","BLQ",..: 7 5 7 7 7 7 7 7 7 7 ...
$ Central Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ Condition 1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
$ Condition 2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Exter Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Exter Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 3 4 4 3 3 3 4 ...
$ Exterior 1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
$ Exterior 2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
$ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA 3 NA NA 3 NA NA NA NA NA ...
$ Fireplace Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 5 3 NA NA 5 5 ...
$ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
$ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
$ Garage Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Garage Finish : Factor w/ 4 levels "","Fin","RFn",..: 2 4 4 2 2 2 2 3 3 2 ...
$ Garage Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Garage Type : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Heating QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 2 5 5 1 3 1 1 1 1 3 ...
$ House Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
$ Kitchen Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 5 3 3 3 3 3 ...
$ Land Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
$ Land Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
$ Lot Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
$ Lot Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 1 4 1 4 1 1 4 1 1 4 ...
$ MS Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
$ Mas Vnr Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 6 5 3 5 5 3 5 5 5 5 ...
$ Misc Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA 2 NA NA NA NA NA NA NA ...
$ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 25 25 9 ...
$ Paved Drive : Factor w/ 3 levels "N","P","Y": 2 3 3 3 3 3 3 3 3 3 ...
$ Pool QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
$ Roof Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Roof Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
$ Sale Condition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Sale Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
Create a table with the number of missing values per variable.
# sapply(df, function(x) sum(is.na(x)))
# table of missing values per variable
f_kbl_with_NA(dt) %>%
kable_styling(
full_width = FALSE,
position = "left",
htmltable_class = "lighttable-hover lighttable-condensed lightable-striped"
) | variable | value |
|---|---|
| Pool QC | 2917 |
| Misc Feature | 2824 |
| Alley | 2732 |
| Fence | 2358 |
| Fireplace Qu | 1422 |
| Lot Frontage | 490 |
| Garage Yr Blt | 159 |
| Garage Qual | 158 |
| Garage Cond | 158 |
| Garage Type | 157 |
| Garage Finish | 157 |
| Bsmt Qual | 79 |
| Bsmt Cond | 79 |
| Bsmt Exposure | 79 |
| BsmtFin Type 1 | 79 |
| BsmtFin Type 2 | 79 |
| Mas Vnr Area | 23 |
| Bsmt Full Bath | 2 |
| Bsmt Half Bath | 2 |
| BsmtFin SF 1 | 1 |
| BsmtFin SF 2 | 1 |
| Bsmt Unf SF | 1 |
| Total Bsmt SF | 1 |
| Garage Cars | 1 |
| Garage Area | 1 |
Create a table with descriptive statistics for all included variables.
For continuous variables, you can use the describe function (from the psych-package).
For factor variables, you can use the table function (base-R).
dt[, psych::describe(.SD), .SDcols = dt_int] %>%
as.data.table(keep.rownames = "cont_vars") %>%
DT::datatable(
caption = "Table 4: Describe numerics",
class = "stripe",
rownames = FALSE,
filter = "top",
extensions = c('FixedColumns'),
options = list(
scrollX = TRUE,
fixedColumns = list(leftColumns = 1)
)
) %>%
formatStyle(
"cont_vars",
color = "#370037",
backgroundColor = "#FFFFF0",
fontWeight = "bold"
)my_cnt <-
function(x){
data.table(col = x) %>%
.[, .(cnt = .N), by = col]
}
dt<=dtb
dt[, (names(dt)) := lapply(.SD, as.factor), .SDcols = sapply(dt, is.character)]
# Reshape the data.table into long format
cols <- sapply(dt, is.factor) %>% .[.==TRUE]
dt6 <- dt[, ..cols]
dt_long <- melt(dt6, measure.vars = names(dt6), variable.name = "Column")
# Create bar chart for each column
ggplot(dt_long, aes(x = fct_infreq(value))) +
geom_bar() +
facet_wrap(~Column, scales = "free_x") +
labs(x = "Value", y = "Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
tst<- lapply(dt5, my_cnt)
dt[, .()]
ggplot( mapping = aes(x = f, y = cnt)) +
geom_col() +
coord_flip() +
facet_wrap(facets = vars(c), scales = "free")
temp <-
df %>%
purrr::keep(is.factor)
for (i in 1:ncol(temp)) {
print(names(temp[i]))
print(table(temp[, i]))
}There a several missing values in the dataset, which need to be tackled before we can proceed with the rest of the analysis.
There are many ways to impute missing values, but for now, impute missing values for numeric variables with the median, and impute missings in all factor variables with the label “100”.
# impute NA with median for all numeric variables
dt[, (dt_int) := lapply(.SD, function(x){
ifelse(is.na(x), median(x, na.rm=T), x)}), .SDcols = dt_int]
# table of missing values per variable
f_kbl_with_NA(dt)| variable | value |
|---|---|
| Pool QC | 2917 |
| Misc Feature | 2824 |
| Alley | 2732 |
| Fence | 2358 |
| Fireplace Qu | 1422 |
| Garage Qual | 158 |
| Garage Cond | 158 |
| Garage Type | 157 |
| Garage Finish | 157 |
| Bsmt Qual | 79 |
| Bsmt Cond | 79 |
| Bsmt Exposure | 79 |
| BsmtFin Type 1 | 79 |
| BsmtFin Type 2 | 79 |
df <-
lapply(df, function(x) {
### Impute median for all missing numeric values
if(is.numeric(x)) ifelse(is.na(x), median(x, na.rm=T), x) else x
}
) %>%
data.frame()# generate a vector with variable names for all factor variables
factor_variables <-
df %>%
keep(is.factor) %>%
names
# impute missing values for factor variables
df<-
lapply(df,function(x) {
if(is.factor(x)) ifelse(is.na(x),"100",x) else x
}) %>%
data.frame()
# 100 imputation for factor variables
dt[, (dt_chr) := lapply(.SD, function(x) {
ifelse(is.na(x), "100", as.character(x))
}), .SDcols = dt_chr]
# convert factor variables back to factor variables
# (imputation turned them into character variables)
df[factor_variables] <- lapply(df[factor_variables], factor)
dt[, (dt_chr) := lapply(.SD, as.factor), .SDcols = dt_chr]# sapply(df, function(x) sum(is.na(x)))
# table of missing values per variable
f_kbl_with_NA(dt)| variable | value |
|---|---|
# table of blank values per variable
dt[, lapply(.SD, function(x) sum(trimws(x) == '', na.rm = TRUE))] %>%
melt.data.table(measure.vars = names(.)) %>%
.[value > 0] %>%
setorder(-value) %>%
kbl(
align = "l"
)| variable | value |
|---|---|
| Mas Vnr Type | 23 |
| Bsmt Exposure | 4 |
| BsmtFin Type 2 | 2 |
| Garage Finish | 2 |
| Bsmt Qual | 1 |
| Bsmt Cond | 1 |
| BsmtFin Type 1 | 1 |
| Electrical | 1 |
| Garage Qual | 1 |
| Garage Cond | 1 |
# 100 imputation for factor variables
dt[, (dt_chr) := lapply(.SD, function(x) {
ifelse(x == '', "100", as.character(x))
}), .SDcols = dt_chr]
# convert factor variables back to factor variables
dt[, (dt_chr) := lapply(.SD, as.factor), .SDcols = dt_chr]# table of blank values per variable
dt[, lapply(.SD, function(x) sum(trimws(x) == '', na.rm = TRUE))] %>%
melt.data.table(measure.vars = names(.)) %>%
.[value > 0] %>%
setorder(-value) %>%
kbl(
align = "l"
)| variable | value |
|---|---|
dtVV <-
dt[, ..dt_chr] %>%
melt.data.table(measure.vars = dt_chr) %>%
unique()
dtVV %>%
DT::datatable(
caption = "Table 4a: Variable Values",
class = "stripe",
rownames = FALSE,
filter = "top",
extensions = c("FixedColumns"),
options = list(
scrollX = TRUE,
fixedColumns = list(leftColumns = 1)
)
) %>%
formatStyle(
"variable",
color = "#370037",
backgroundColor = "#FFFFF0",
fontWeight = "bold"
)Explore the outcome variable (SalePrice) and how it correlates to other features
The variable “SalePrice” refers to the price at which a property was sold and hence is the variable of interest for our prediction model (“Y” or dependent variable).
Please explore Y in terms of:
Visualize the distribution of Y (e.g. use base-R “hist” or “ggplot” from the “ggplot2”-package)
Visualize the distribution of Y by looking at various subgroups
(e.g. create boxplot or scatterplot using the “ggplot2”-package).
Look at differences between neighborhoods.
Look at differences between housing style.
Draw a correlation plot to see all correlations between Y and the independent (numeric) variables.
For visualization, ggplot is frequently used as it provides a flexible way to draw a lot of different graphs.
ggplot contains two basic elements:
The initiation command:
ggplot(DATASET, aes(x=XVAR, y=YVAR, group=XVAR))
This draws a blank ggplot. Even though the x and y are specified, there are no points or lines in it.
Add the respective geom of interest (for this exercise you’ll need:
+ geom_point() (for scatterplot) or
+ geom_boxplot()
The full code to write a scatter plot would then be:
ggplot(DATASET, aes(x=XVAR, y=YVAR)) + geom_point()
To draw a correlation plot. Please use the “corrplot”-package.
Using this package, one can construct a correlation plot in two steps:
Use “cor” to calculate correlation between all combinations of numeric variables
select numeric variables by using: df %>% keep(is.numeric)
Plot the calculated correlation by using the corrplot -function
# Descriptive/summary statistics (e.g. mean, SDs, range)
dt$SalePrice %>%
psych::describe() %>%
t() %>%
as.data.table(
keep.rownames = "stat") %>%
.[, .(stat,
SalesPrice = X1)] %>%
kbl(
digits = 0,
caption = "Table 5: Descriptive statistics for Sales Price",
format.args = list(big.mark = ","),
align = 'l'
) %>%
kable_styling(
full_width = FALSE,
position = "left",
htmltable_class = "lighttable-hover lighttable-condensed lightable-striped") | stat | SalesPrice |
|---|---|
| vars | 1 |
| n | 2,930 |
| mean | 180,796 |
| sd | 79,887 |
| median | 160,000 |
| trimmed | 170,429 |
| mad | 54,856 |
| min | 12,789 |
| max | 755,000 |
| range | 742,211 |
| skew | 2 |
| kurtosis | 5 |
| se | 1,476 |
# Visualize the distribution of Y
# (e.g. use base-R "hist" or "ggplot" from the "ggplot2"-package)
hist(dt$SalePrice)ggplot(data = dt, aes(SalePrice)) +
geom_histogram(fill = "#005100", color = "#FFFFF0", bins = 18) +
# scale_x_continuous(limits = c(0,600000), expand = c(0, 0)) +
# scale_y_continuous(limits = c(0,650) , expand = c(0, 0)) +
labs(title = "Histogram of Sale Price") +
ylab(label = "Count") +
xlab(label = "Sale Price") +
# theme_classic() +
theme(
axis.title.x = element_text(
colour = "#370037", size = 11.5, face = "bold"),
axis.title.y = element_text(
colour = "#370037", size = 11.5, face = "bold"),
plot.title = element_text(
colour = "#370037", size = 18 , face = "bold", hjust = 0)
) # Visualize the distribution of Y by looking at various subgroups
# (e.g. create boxplot or scatterplot using the "ggplot2"-package)
# Scatterplot
p1 <-
ggplot(data = dt, aes(x = `Lot Area`, y = SalePrice)) +
geom_point(size = .7, color = "#005100") +
scale_x_continuous(limits = c(0, 50000) , expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 600000), expand = c(0, 0)) +
labs(title = "Scatterplot Sale Price by Lot Area") +
ylab(label = "Sale Price") +
xlab(label = "Lot area") +
# theme_classic() +
theme(
axis.title.x = element_text(
colour = "#370037", size = 11.5, face = "bold"),
axis.title.y = element_text(
colour = "#370037", size = 11.5, face = "bold"),
plot.title = element_text(
colour = "#370037", size = 18 , face = "bold", hjust = 0))
# Side-by-side plots, only 1
grid.arrange(p1, nrow = 1)Warning: Removed 20 rows containing missing values (`geom_point()`).
# Boxplot
dt[, avgSP := mean(SalePrice), by = Neighborhood] %>%
.[, Neighborhood := fct_reorder(Neighborhood, avgSP)] %>%
.[, avgSP := NULL] %>%
ggplot(aes(x = Neighborhood, y = SalePrice)) +
geom_boxplot(color = "#005100", fill = "#FFFFF0") +
labs(title = "Boxplot Sale Price by Neighbourhood") +
ylab(label = "Sale Price") +
xlab(label = "Neighbourhood") +
# theme_classic() +
theme(
axis.title.x = element_text(
colour = "#370037", size = 11.5, face = "bold"),
axis.title.y = element_text(
colour = "#370037", size = 11.5, face = "bold"),
plot.title = element_text(
colour = "#370037", size = 18 , face = "bold", hjust = 0),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
)Box-plots are sorted by the mean of the dependent variable (SalePrice). The mean of the dependent variable is calculated for each level of the independent variable (House Style).
The levels of the independent variable are reordered based on the mean of the dependent variable.
#|label: Look at differences between housing style
dt[, avgHS := mean(SalePrice), by = `House Style`] %>%
.[, `House Style` := fct_reorder(`House Style`, avgHS)] %>%
.[, avgHS := NULL] %>%
ggplot(aes(x = `House Style`, y = SalePrice)) +
geom_boxplot(color = "#005100", fill = "#FFFFF0") +
labs(title = "Boxplot Sale Price by House Style") +
ylab(label = "Sale Price") +
xlab(label = "House Style") +
# theme_classic() +
theme(
axis.title.x = element_text(
colour = "#370037", size = 11.5, face = "bold"),
axis.title.y = element_text(
colour = "#370037", size = 11.5, face = "bold"),
plot.title = element_text(
colour = "#370037", size = 18 , face = "bold", hjust = 0),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
)# corr_df <-
# df %>%
# keep(is.numeric) %>%
# cor
corr_dt <-
dt[, ..dt_int] %>%
cor(
use = "everything",
method = "pearson"
)
corrplot(
corr = corr_dt,
type = "upper",
title = "Correlation between all numeric variables in the dataset",
diag = FALSE,
order = 'hclust',
hclust.method = 'median',
addrect = 3,
number.font = 2,
tl.cex = 0.50,
mar = c(0, 0, 1, 0)
)corr_dt[, "SalePrice"] %>%
as.data.table(
keep.rownames = "var",
check.names = FALSE
) %>%
setnames(".", "corr") %>%
setorder(-corr) %>%
kbl()| var | corr |
|---|---|
| SalePrice | 1.0000000 |
| Overall Qual | 0.7992618 |
| Gr Liv Area | 0.7067799 |
| Garage Cars | 0.6478115 |
| Garage Area | 0.6403811 |
| Total Bsmt SF | 0.6321639 |
| 1st Flr SF | 0.6216761 |
| Year Built | 0.5584261 |
| Full Bath | 0.5456039 |
| Year Remod/Add | 0.5329738 |
| Garage Yr Blt | 0.5088825 |
| Mas Vnr Area | 0.5021960 |
| TotRms AbvGrd | 0.4954744 |
| Fireplaces | 0.4745581 |
| BsmtFin SF 1 | 0.4328618 |
| Lot Frontage | 0.3402558 |
| Wood Deck SF | 0.3271432 |
| Open Porch SF | 0.3129505 |
| Half Bath | 0.2850560 |
| Bsmt Full Bath | 0.2758227 |
| 2nd Flr SF | 0.2693734 |
| Lot Area | 0.2665492 |
| Bsmt Unf SF | 0.1828955 |
| Bedroom AbvGr | 0.1439134 |
| Screen Porch | 0.1121512 |
| Pool Area | 0.0684032 |
| Mo Sold | 0.0352588 |
| 3Ssn Porch | 0.0322246 |
| BsmtFin SF 2 | 0.0060176 |
| Misc Val | -0.0156915 |
| Yr Sold | -0.0305691 |
| Order | -0.0314079 |
| Bsmt Half Bath | -0.0358166 |
| Low Qual Fin SF | -0.0376598 |
| MS SubClass | -0.0850916 |
| Overall Cond | -0.1016969 |
| Kitchen AbvGr | -0.1198137 |
| Enclosed Porch | -0.1287874 |
| PID | -0.2465212 |
Now that we have a better feeling of the information in the data set and we took care of the missing values, we can start by running some (additional) simple machine learning models.
We will use the “caret”-package for this exercise. Split the data randomly into a train set (70%) and test set (30%)
set.seed(1234)
# use the caret::createDataPartition function to split the data
Index <-
createDataPartition(dt$Order, p = 0.7, list = FALSE)
train <- dt[ Index, ]
test <- dt[-Index, ]Next we need to specify how we want to perform the cross-validation (i.e. the optimization of the model on the train set). To this extend we need to set the method of CV, the number of folds and the numer of times we want to repeat the process. We will use the “repeatedcv” method, with 10 folds and 3 repeats.
# Cross-validation strategy from the caret package
ctrl <-
trainControl(
method = "repeatedcv",
number = 5, # ten folds
repeats = 3) # repeated three times# Scatterplot with smoother lm
copy(dt[, ..dt_int]) %>%
melt.data.table(
id.vars = c("Order", "SalePrice")
) %>%
ggplot(
aes(x = value , y = SalePrice)) +
geom_point(size = .7, color = "#005100") +
geom_smooth(
method = "lm",
se = FALSE,
color = "#0000FF",
lwd = 2 ) +
facet_wrap(
ncol = 4,
facets = ~ variable,
scales = "free") +
# coord_cartesian(
# xlim = c(0, 50000),
# ylim = c(0, 600000)) +
labs(title = "Scatterplot Sale Price by Variable") +
ylab(label = "Sale Price") +
# xlab(label = "Lot area") +
# theme_classic() +
theme(
axis.title.x = element_text(
colour = "#370037", size = 11.5, face = "bold"),
axis.title.y = element_text(
colour = "#370037", size = 11.5, face = "bold"),
plot.title = element_text(
colour = "#370037", size = 18 , face = "bold", hjust = 0))# Side-by-side plots
# grid.arrange(p2, nrow = 1)Calculate how well the model explains the variance in the data (R2).
# Fit the linear regression model on the training data
model <- lm(SalePrice ~ ., data = train)
# View the summary of the model
sum_mod <- summary(model)
paste(
"Multiple R-squared:", round(sum_mod$r.squared , 3),
"Adjusted R-squared:", round(sum_mod$adj.r.squared, 3)
)[1] "Multiple R-squared: 0.949 Adjusted R-squared: 0.942"
# Extract the coefficients and their standard errors
coefficients <- coef(model)
# Extract the p-values for each coefficient
p_values <- summary(model)$coefficients[, "Pr(>|t|)"]
# Coefficients: (6 not defined because of singularities)
setdiff(names(coefficients), names(p_values))[1] "`Mas Vnr Type`CBlock" "`Bsmt Cond`TA" "`BsmtFin Type 1`Unf"
[4] "`Gr Liv Area`" "`Garage Qual`TA" "`Garage Cond`TA"
# create table with the coefficients and their importance measures
data.table(
Variable = names(p_values),
Coefficient = coefficients[names(p_values)],
P_Value = p_values,
Importance = abs(coefficients[names(p_values)]) / sum(abs(coefficients[names(p_values)]))
) %>%
.[P_Value < 0.05] %>%
setorder(-P_Value) %>%
DT::datatable(
caption = "Table 6: Linear Regression model",
class = "compact stripe",
rownames = FALSE,
filter = 'top',
extensions = c('FixedColumns'),
options = list(
scrollX = TRUE,
fixedColumns = list(leftColumns = 1)
)
) %>%
formatStyle(
'Variable',
color = "#003700",
fontWeight = "bold",
backgroundColor = '#FFFFF0'
) What does it mean when in a linear regression model you have singularities and how to solve this?
A singularity in a linear regression model means that one or more of the independent variables can be expressed as a linear combination of the other independent variables.
This is a problem because it means that the model cannot distinguish between the effects of the variables that are linearly dependent.
To address singularities caused by multicollinearity, you can take the following steps:
Identify the variables causing multicollinearity: Look for high pairwise correlations or examine variance inflation factors (VIF) to identify the variables that contribute to multicollinearity.
Resolve multicollinearity:
Remove one or more of the highly correlated variables. Combine correlated variables to create new composite variables. Use dimensionality reduction techniques like principal component analysis (PCA). Assess the impact: Re-estimate the model after resolving multicollinearity and examine the changes in coefficients, standard errors, and significance levels.
Can I use AIC to determine which variables I need to use in my linear regression model?
Yes, you can use the Akaike Information Criterion (AIC) to determine which variables to include in your linear regression model. > The AIC is a metric that balances the goodness of fit of a model with its complexity, penalizing models with more parameters.
The general idea is to compare the AIC values of different models with different sets of variables and select the model with the > lowest AIC as the preferred model.
We initially fit a model using all potential variables. Then, we iterate over each variable and fit models without each variable, calculating the AIC for each reduced model. The variables with the lowest AIC values are considered the most informative and are selected for the final model.
# set hyperparameter k
k <- 10
# Use the lm model generated
initial_model <- model
# Calculate AIC for the initial model
initial_aic <- AIC(initial_model)
# Initialize a list to store the AIC values
aic_values <- list()
# train_df <- as.data.frame(train)
train_dt <-
copy(train) %>%
setNames(gsub(" ", "_", names(.)))
# names(train_dft <- gsub(" ", "_", names(train_dt))
# Iterate over each variable to evaluate its contribution to the model
for (var in names(train_dt)) {
# Skip the dependent variable
if (var == "SalePrice")
next
# Fit a model without the current variable
reduced_model <- lm(formula(paste("SalePrice ~ . -", var)), data = train_dt)
# Calculate AIC for the reduced model
aic <- AIC(reduced_model)
# Store the AIC value in the list
aic_values[[var]] <- aic
}
# Sort the AIC values in ascending order
sorted_aic <- sort(unlist(aic_values))
# Identify the variables with the lowest AIC values
selected_vars <- names(sorted_aic)[1:k]
# Build the final model using the selected variables
final_model <-
lm(SalePrice ~ ., data = train_dt[, c("SalePrice", selected_vars), with = FALSE])lambda <- 10^seq(-3, 3, length = 100)
lassoFit <-
train(
SalePrice ~ .,
data = train,
method = "glmnet",
trControl = ctrl,
preProcess = c("center", "scale"),
tuneGrid = expand.grid(alpha = 1, lambda = lambda))Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Condition 2`PosN, `Roof Matl`Roll,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Kitchen
Qual`Po, FunctionalSal, `Misc Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, HeatingOthW,
FunctionalSal, `Misc Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Pool QC`Fa, `Misc
Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran,
`Exterior 1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast, `Exterior
1st`Stone, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Bsmt
Qual`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAe, `Condition 2`RRAn,
`Roof Matl`Roll, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`Gar2, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Style`Shed, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Pool QC`Fa, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
FunctionalSal, `Misc Feature`Elev, `Misc Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAn, `Roof Matl`Membran, `Roof Matl`Roll,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt
Qual`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRNn, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, HeatingOthW, ElectricalMix, `Kitchen Qual`Po, FunctionalSal, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, `Misc Feature`TenC
lassoFit # to obtain summary of the modelglmnet
2054 samples
81 predictor
Pre-processing: centered (279), scaled (279)
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 1644, 1643, 1644, 1642, 1643, 1643, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.001000000 41936.43 0.7607009 17194.09
0.001149757 41936.43 0.7607009 17194.09
0.001321941 41936.43 0.7607009 17194.09
0.001519911 41936.43 0.7607009 17194.09
0.001747528 41936.43 0.7607009 17194.09
0.002009233 41936.43 0.7607009 17194.09
0.002310130 41936.43 0.7607009 17194.09
0.002656088 41936.43 0.7607009 17194.09
0.003053856 41936.43 0.7607009 17194.09
0.003511192 41936.43 0.7607009 17194.09
0.004037017 41936.43 0.7607009 17194.09
0.004641589 41936.43 0.7607009 17194.09
0.005336699 41936.43 0.7607009 17194.09
0.006135907 41936.43 0.7607009 17194.09
0.007054802 41936.43 0.7607009 17194.09
0.008111308 41936.43 0.7607009 17194.09
0.009326033 41936.43 0.7607009 17194.09
0.010722672 41936.43 0.7607009 17194.09
0.012328467 41936.43 0.7607009 17194.09
0.014174742 41936.43 0.7607009 17194.09
0.016297508 41936.43 0.7607009 17194.09
0.018738174 41936.43 0.7607009 17194.09
0.021544347 41936.43 0.7607009 17194.09
0.024770764 41936.43 0.7607009 17194.09
0.028480359 41936.43 0.7607009 17194.09
0.032745492 41936.43 0.7607009 17194.09
0.037649358 41936.43 0.7607009 17194.09
0.043287613 41936.43 0.7607009 17194.09
0.049770236 41936.43 0.7607009 17194.09
0.057223677 41936.43 0.7607009 17194.09
0.065793322 41936.43 0.7607009 17194.09
0.075646333 41936.43 0.7607009 17194.09
0.086974900 41936.43 0.7607009 17194.09
0.100000000 41936.43 0.7607009 17194.09
0.114975700 41936.43 0.7607009 17194.09
0.132194115 41936.43 0.7607009 17194.09
0.151991108 41936.43 0.7607009 17194.09
0.174752840 41936.43 0.7607009 17194.09
0.200923300 41936.43 0.7607009 17194.09
0.231012970 41936.43 0.7607009 17194.09
0.265608778 41936.43 0.7607009 17194.09
0.305385551 41936.43 0.7607009 17194.09
0.351119173 41936.43 0.7607009 17194.09
0.403701726 41936.43 0.7607009 17194.09
0.464158883 41936.43 0.7607009 17194.09
0.533669923 41936.43 0.7607009 17194.09
0.613590727 41936.43 0.7607009 17194.09
0.705480231 41936.43 0.7607009 17194.09
0.811130831 41936.43 0.7607009 17194.09
0.932603347 41936.43 0.7607009 17194.09
1.072267222 41936.43 0.7607009 17194.09
1.232846739 41936.43 0.7607009 17194.09
1.417474163 41936.43 0.7607009 17194.09
1.629750835 41936.43 0.7607009 17194.09
1.873817423 41936.43 0.7607009 17194.09
2.154434690 41936.43 0.7607009 17194.09
2.477076356 41936.43 0.7607009 17194.09
2.848035868 41936.43 0.7607009 17194.09
3.274549163 41936.43 0.7607009 17194.09
3.764935807 41936.43 0.7607009 17194.09
4.328761281 41936.43 0.7607009 17194.09
4.977023564 41936.43 0.7607009 17194.09
5.722367659 41936.43 0.7607009 17194.09
6.579332247 41931.42 0.7607403 17193.22
7.564633276 41875.71 0.7611846 17186.72
8.697490026 41811.16 0.7617076 17176.53
10.000000000 41729.73 0.7623733 17162.75
11.497569954 41638.96 0.7631148 17147.19
13.219411485 41535.06 0.7639686 17129.60
15.199110830 41424.41 0.7648811 17110.27
17.475284000 41299.13 0.7659093 17089.37
20.092330026 41136.09 0.7672316 17063.13
23.101297001 40927.51 0.7689158 17033.53
26.560877829 40710.69 0.7706659 17001.18
30.538555088 40477.17 0.7725514 16966.42
35.111917342 40221.78 0.7746299 16926.88
40.370172586 39932.47 0.7769898 16882.78
46.415888336 39627.35 0.7794709 16838.41
53.366992312 39308.89 0.7820395 16794.80
61.359072734 38968.04 0.7847822 16745.74
70.548023107 38584.07 0.7878782 16690.45
81.113083079 38145.18 0.7914312 16630.57
93.260334688 37765.64 0.7944114 16575.12
107.226722201 37335.83 0.7978204 16521.39
123.284673944 36813.12 0.8020534 16460.89
141.747416293 36193.39 0.8071349 16390.83
162.975083462 35512.57 0.8127587 16321.89
187.381742286 34777.46 0.8188310 16257.48
215.443469003 33997.07 0.8252792 16196.30
247.707635599 33285.87 0.8311407 16156.47
284.803586844 32564.24 0.8370304 16131.39
327.454916288 31867.12 0.8426471 16125.18
376.493580679 31237.21 0.8476845 16132.50
432.876128108 30809.27 0.8510071 16157.94
497.702356433 30652.24 0.8520907 16191.88
572.236765935 30593.71 0.8524009 16236.99
657.933224658 30562.75 0.8524742 16312.17
756.463327555 30547.32 0.8524215 16387.81
869.749002618 30541.04 0.8522912 16459.52
1000.000000000 30536.35 0.8521574 16545.32
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 1000.
varImp(lassoFit) # to see most important parametersglmnet variable importance
only 20 most important variables shown (out of 279)
Overall
`Gr Liv Area` 100.00
`Overall Qual` 64.67
`Misc Feature`Elev 35.36
`Bsmt Qual`Ex 34.89
`Condition 2`PosN 27.94
NeighborhoodNridgHt 27.35
`MS SubClass` 25.70
`Bsmt Exposure`Gd 20.69
NeighborhoodStoneBr 20.39
NeighborhoodNoRidge 19.84
`Sale Type`New 18.82
`Pool QC`Gd 18.47
`Year Built` 18.28
`Mas Vnr Area` 17.13
`BsmtFin SF 1` 16.79
`Total Bsmt SF` 16.08
`Garage Cars` 15.42
`Overall Cond` 14.57
`Lot Area` 13.28
Fireplaces 12.61
plot(varImp(lassoFit)) # to plot most important parameters## Run kNN
knnFit <-
train(
SalePrice ~ .,
data = train,
method = "knn",
trControl = ctrl,
preProcess = c("center", "scale")
)Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`ImStucc, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`ImStucc, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`ImStucc, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, ElectricalMix, `Kitchen Qual`Po,
FunctionalSal, FunctionalSev, `Pool QC`Fa, `Misc Feature`Elev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, ElectricalMix, `Kitchen Qual`Po,
FunctionalSal, FunctionalSev, `Pool QC`Fa, `Misc Feature`Elev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior 2nd`Other,
`Exterior 2nd`PreCast, `Exter Cond`Po, ElectricalMix, `Kitchen Qual`Po,
FunctionalSal, FunctionalSev, `Pool QC`Fa, `Misc Feature`Elev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Roof Matl`Metal, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAn, `Roof Matl`Roll,
`Exterior 1st`CBlock, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAn, `Roof Matl`Roll,
`Exterior 1st`CBlock, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`RRAn, `Roof Matl`Roll,
`Exterior 1st`CBlock, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, `Mas Vnr Type`CBlock, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Pool QC`TA,
`Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Pool QC`TA,
`Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGrnHill, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Bsmt Qual`Po, FunctionalSal, `Pool QC`TA,
`Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`AsphShn, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
ElectricalMix, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`AsphShn, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
ElectricalMix, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Roof Matl`Membran, `Exterior 1st`AsphShn, `Exterior
1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po,
ElectricalMix, `Kitchen Qual`Po, FunctionalSal, FunctionalSev, `Misc
Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Metal, `Roof Matl`Roll,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Metal, `Roof Matl`Roll,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Metal, `Roof Matl`Roll,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Cond`Ex,
HeatingOthW, FunctionalSal, `Pool QC`Fa, `Misc Feature`Elev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC, `Sale
Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC, `Sale
Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior
2nd`Other, `Exterior 2nd`PreCast, FunctionalSal, `Misc Feature`TenC, `Sale
Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Condition 2`RRAn,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Heating
QC`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Condition 2`RRAn,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Heating
QC`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`PosN, `Condition 2`RRAe, `Condition 2`RRAn,
`Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast, `Heating
QC`Po, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`PosA,
`Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`PosA,
`Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSeWa,
UtilitiesNoSewr, NeighborhoodLandmrk, NeighborhoodGrnHill, `Condition 2`PosA,
`Condition 2`RRAn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, FunctionalSev, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGreens, `Condition 2`PosN, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`Elev, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGreens, `Condition 2`PosN, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`Elev, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, NeighborhoodGreens, `Condition 2`PosN, `Roof Matl`Metal,
`Exterior 1st`ImStucc, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, ElectricalMix, FunctionalSal, `Misc Feature`Elev, `Misc
Feature`TenC, `Sale Type`VWD
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast,
FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast,
FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRNn, `Roof Matl`Membran, `Exterior
1st`AsphShn, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior 2nd`PreCast,
FunctionalSal, `Pool QC`Fa, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Matl`Roll, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Kitchen Qual`Po,
FunctionalSal, `Misc Feature`Gar2, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Matl`Roll, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Kitchen Qual`Po,
FunctionalSal, `Misc Feature`Gar2, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Condition 2`RRAe, `Roof Matl`Roll, `Exterior 1st`PreCast,
`Exterior 2nd`Other, `Exterior 2nd`PreCast, `Exter Cond`Po, `Kitchen Qual`Po,
FunctionalSal, `Misc Feature`Gar2, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Qual`Po,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Qual`Po,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 1st`Stone, `Exterior
2nd`Other, `Exterior 2nd`PreCast, `Mas Vnr Type`CBlock, `Bsmt Qual`Po,
HeatingOthW, FunctionalSal, `Misc Feature`TenC
Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
10, : These variables have zero variances: `MS Zoning`I (all), UtilitiesNoSewr,
NeighborhoodLandmrk, `Exterior 1st`PreCast, `Exterior 2nd`Other, `Exterior
2nd`PreCast, FunctionalSal, `Misc Feature`TenC
knnFit # to obtain summary of the modelk-Nearest Neighbors
2054 samples
81 predictor
Pre-processing: centered (279), scaled (279)
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 1643, 1643, 1644, 1643, 1643, 1643, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 38780.89 0.7663277 25177.59
7 37898.73 0.7792425 24601.56
9 37811.90 0.7817787 24396.81
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.
plot(knnFit)varImp(knnFit) # to see most important parametersloess r-squared variable importance
only 20 most important variables shown (out of 81)
Overall
Overall Qual 100.00
Neighborhood 80.72
Gr Liv Area 80.08
Total Bsmt SF 74.98
Garage Area 70.58
1st Flr SF 68.65
Garage Cars 66.43
Exter Qual 66.17
Kitchen Qual 56.39
Year Built 50.35
Full Bath 47.58
Year Remod/Add 44.92
BsmtFin SF 1 42.03
Garage Yr Blt 41.42
Mas Vnr Area 41.04
TotRms AbvGrd 39.77
Bsmt Qual 35.16
Fireplaces 34.42
2nd Flr SF 31.70
PID 30.44
plot(varImp(knnFit)) # to plot most important parametersThe performance metric for the prediction model should be the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sale price. This makes it the Root-Mean-Squared-Log-Error (RMSLE). By plotting a histogram of the sale price you will understand why the logarithm is recommended.
# Make predictions on the test data
predictions <- predict(model, newdata = test)
# Calculate evaluation metrics (e.g., RMSE)
rmse <- caret::RMSE(predictions, test$SalePrice)# LASSO
pred_lassoFit <-
predict(lassoFit, newdata = test)
lasso_rmse <-
rmse(
actual = test$SalePrice,
predicted = pred_lassoFit
) %>%
round(3)
# KNN
pred_knn <-
predict(knnFit, newdata = test)
knn_rmse <-
rmse(
actual = test$SalePrice,
predicted = pred_knn
) %>%
round(3)
data.table(
Model = c("Lasso" , "KNN"),
RMSE = c(lasso_rmse, knn_rmse)
) %T>%
setorder(RMSE) %>%
.[, .(Rank= 1:.N, Model, RMSE)] %>%
kbl(
caption = "Model performance",
align = 'l',
centering = F
) %>%
kable_styling(
full_width = FALSE,
position = "left",
htmltable_class = "lighttable-hover lighttable-condensed lightable-striped"
) %>%
## Appendixdata_description <-
paste0(github_ames, "data_description.txt") %>%
readLines()r data_description